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Amendments to the Claims 

This listing of claims will replace all prior versions of claims in the application: 
Listing of Claims: 

1 . (Currently Amended) A system for summarizing audio information, comprising: 
an analyzer to convert audio into frames; 

a fingerprinting component to convert the frames into fingerprints, each fingerprint based 
in part on a plurality of frames; 

a similarity detector to compute similarities between fingerprints, the similarity detector 
comprising a clustering function, the clustering function producing one or more sets of clusters 
of fingerprints based upon all fingerprints within a set of clusters meeting an initial threshold 
indicative of similarity; 

a heuristic module to generate a thumbnail of the audio file from a set of clusters that has 
at least two gaps between fingerprints, wherein a gap is a temporal space between two adjacent 
fingerprints that exceeds a predetermined threshold when fingerprints within a set of clusters are 
placed in sequential temporal order , the heuristic module comprisinR a flatness component in 
order to determine a suitable segment of audio for the thumbnail, the flatness component 
employs a number that is added to spectral magnitudes for each frequency component, to 
mitigate numerical problems when determining logs . 

2. (Currently Amended) The system of claim 1 , the heuristic module comprising at least one 
ef-an energy component and a flatness component in order to help determine a suitable segment 
of audio for the thumbnail. 

3. (Original) The system of claim 2, the heuristic module is employed to automatically 
select voiced choruses over instrumental portions. 

4. (Original) The system of claim 2, the energy component and the flatness component are 
employed when the fingerprints do not result in finding a suitable chorus. 
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5. (Original) The system of claim 1, further comprising a component to remove silence at 
the beginning and end of an audio clip via an energy-based threshold. 

6. (Original) The system of claim 1, the fingerprint component further comprising a 
normalization component, such that an average Euclidean distance from the each fingerprint to 
other fingerprints for an audio clip is one. 

7. (Original) The system of claim 1 , the analyzer computes a set of spectral magnitudes for 
an audio frame. 

8. (Original) The system of claim 7, for each frame, a mean, normalized energy E is 
computed by dividing a mean energy per frequency component within the frame by the average 
of that quantity over frames in an audio file. 

9. (Original) The system of claim 8, further comprising a component that selects a middle 
portion of an audio file to mitigate quiet introduction and fades appearing in the audio file. 

10. (Cancelled). 

1 1 . (Currently Amended) The system of claim \ [[10]], the flatness component includes a 
frame-quantity computed as a log normalized geometric mean of the spectral magnitudes. 

12. (Original) The system of claim 1 1 , the normalization is performed by subtracting a per- 
frame log arithmetic mean of a per-frame magnitudes from the geometric mean. 

13. (Previously Presented) The system of claim 1, the heuristic component selects the set of 
clusters from which to generate the audio thumbnail based upon at least one of a mean spectral 
quality value determined for the set of clusters or a cluster spread quality value determined for 
the set of clusters . 
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14. (Previously Presented) The system of claim 13, the heuristic component selects the set of 
clusters that has the highest value for the sum of the square of the mean spectral quality value 
determined for the set of clusters and the cluster spread quality value determined for the set of 
clusters . 

15. (Previously Presented) The system of claim 1, the initial threshold is a normalized 
Euclidian distance between two fingerprints. 

16. (Previously Presented) The system of claim 1, wherein a cluster is a group of fingerprints 
in a set of clusters that lies between the same two gaps or lies between the beginning of the 
sequence of fingerprints and the first gap in the sequence or lies between the last gap in the 
sequence and the end of the sequence of fingerprints. 

17. (Currently Amended) An automatic thumbnail generator, comprising: 
means for converting an audio file into frames; 

means for fingerprinting the audio file, producing fingerprints based in part on a plurality 

of frames; 

means for producing one or more sets of clusters of fingerprints based upon all 
fingerprints within a set of clusters meeting a predefined similarity threshold; and 

means for creating an audio thumbnail by selecting a set of clusters that has at least two 
gaps between fingerprints, wherein a gap is a temporal space between two adjacent fingerprints 
that exceeds a predetermined threshold when fingerprints within a set of clusters are placed in 
sequential temporal order , determine a suitable segment of audio for the thumbnail by at least 
employing a number that is added to spectral magnitudes for each frequency component, to 
mitigate numerical problems when determining logs . 



5 



10/785,560 



MS305553.01/MSFTP561US 



18. (Currently Amended) A method to generate audio thumbnails, comprising: 
generating a plurality of audio fingerprints, each audio fingerprint based in part on a 

plurality of audio frames; 

producing one or more sets of clusters of fingerprints based upon all fingerprints within a 
set of clusters meeting a similarity threshold; and 

creating a thumbnail based upon set of clusters that has at least two gaps between 
fingerprints, wherein a gap is a temporal space between two adjacent fingerprints that exceeds a 
predetermined threshold when fingerprints within a set of clusters are placed in sequential 
temporal order ; and 

clustering fingerprints within a set of clusters into fingerprint clusters based upon the 

gaps 

determining a parameter (D) describing how evenly spread clusters are, temporally, 
throughout an audio file, (D) is measured as follows: 

normalizing a song to have duration of 1 ; 
setting a time position of an / ' th cluster be ; 

defining t q = 0 and ftv+i = 1; and 

computed as - |l - T^' (t j ^ 2 j where N is a number of clusters in a 

cluster set; 

selecting the set of clusters from which to generate the audio thumbnail based upon at 
least parameter (D) . 

19. (Cancelled) 

20. (Original) The method of claim 1 8, the similarity threshold is a normalized Euclidian 
distance between two fingerprints. 

2 1 . (Original) The method of claim 1 8, the similarity threshold chosen adaptively based upon 
the audio file and used to help determine if two fingerprints belong to the same cluster set. 
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22. (Currently Amended) The method of claim 18 [[19]], the clustering operating by 
considering one fingerprint at a time. 

23-25. (Cancelled) 

26. (Currently Amended) The method of claim J_8 [[25]], further comprising determining an 
offset and scaling factor so that (D) takes a maximum value of 1 and minimum value of 0, for 

any N . 

27. (Currently Amended) The method of claim 18 [[25]], further comprising determining a 
mean spectral quality for fingerprints in a set. 

28. (Original) The method of claim 27, wherein a mean spectral flatness for a set, and a 
parameter D, are combined to determine a best cluster set from among a plurality of cluster sets. 

29. (Original) The method of claim 28, the mean spectral flatness and parameter D are 
combined into a single parameter associated with each cluster set, such that the set with the 
external value of the parameter is selected to be the best set. 

30. (Original) The method of claim 29, when the best cluster set is selected, a best fingerprint 
within the cluster set is determined as the fingerprint in which surrounding audio, of duration 
about equal to a duration of an audio thumbnail, has maximum spectral energy or flatness. 

3 1 . (Original) The method of claim 1 8, the creating further comprising determining a cluster 
by determining a longest section of audio within an audio file that repeats in the audio file. 
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32. (Original) The method of claim 1 8, the creating further comprising at least one of: 
rejecting clusters that are close to a beginning or end of a song; 

rejecting clusters for which energy falls below a threshold for any fingerprint in a 
predetermined window; and 

selecting a fingerprint having a highest average spectral flatness measure in the 
predetermined window. 

33. (Original) The method of claim 1 8, the creating further comprising generating a 
thumbnail by specifying time offsets in an audio file. 

34. (Original) The method of claim 1 8, the creating further comprising automatically fading a 
beginning or an end of an audio thumbnail. 

35. (Original) The method of claim 1 8, the generating further comprising processing an audio 
file in at least two layers, where the output of a first layer is based on a log spectrum computed 
over a small window and a second layer operates on a vector computed by aggregating vectors 
produced by the first layer. 

36. (Original) The method of claim 35, further comprising providing a wider temporal 
window in a subsequent layer than a proceeding layer. 



37. (Original) The method of claim 36, further comprising employing at least one of the 
layers to compensate for time misalignment. 



